Extracting structure from human-readable semistructured text

نویسنده

  • Elaine Angelino
چکیده

The explosion of Web data over the past fifteen years has fostered a rich body of research in extracting structure from semistructured HTML and XML documents. Today, we are also in the middle of an explosion of semistructured documents that originate from outside the Web domain. This latter kind of semistructured data is everywhere: in electronic medical records (EMRs), government reports, digital humanities archives, and datasets from many other domains. In contrast to HTML or XML, semistructured text that has not been marked up is typically human-readable (HR), and its structure implicitly reflects a schema. Our high-level goals are to (1) explicitly recover meaningful structure in semistructured text corpora, and (2) demonstrate that this effort enables more accurate analytics and facilitates or enriches other applications. To motivate our approach and work, we highlight the differences between HR and marked-up semistructured text and illustrate concrete use cases of HR semistructured text. We identify specific instances of implicit structure along a spectrum of “structuredness” commonly found in HR semistructured text and describe a corresponding array of methods for extracting structured features. It has not been our goal to develop “the best” such methods, but to instead present a principled framework for evaluating and combining multiple extraction methods in the context of specific data analytic tasks. We also present example applications that can be quickly built on top of a substrate of features extracted from semistructured text. For concreteness, we focus this report around semistructured text found in EMRs and evaluate our framework using a large corpus of text reports from real EMRs exhibiting highly heterogeneous schema.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Augmentation of a Chinese Machine-Readable Dictionary

We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-speciic and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improv...

متن کامل

Exploratory Relation Extraction in Large Text Corpora

In this paper, we propose and demonstrate Exploratory Relation Extraction (ERE), a novel approach to identifying and extracting relations from large text corpora based on user-driven and data-guided incremental exploration. We draw upon ideas from the information seeking paradigm of Exploratory Search (ES) to enable an exploration process in which users begin with a vaguely defined information ...

متن کامل

Extraction of protein interaction information from unstructured text using a context-free grammar

MOTIVATION As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable...

متن کامل

A Robust Practical Text Summarization

We present an automated method of generating human-readable summaries from text documents such as news, technical reports, government documents, and even court records. Our approach exploits an empirical observation that much of the written text display certain regularities of organization and style, which we call the Discourse Macro Structure (DMS). A summary is therefore created to reflect th...

متن کامل

Extracting Transliteration Pairs from Comparable Corpora

Transliterating words and names from one language to another is a frequent and highly productive phenomenon. For example, English word cache is transliterated in Japanese asキャッシュ “kyasshu”. In many cases, recent transliterations are not recorded in machine readable dictionaries so it is impossible to rely on dictionary lookup to find transliteration equivalents. In this paper we describe a meth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012